8 research outputs found
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
We propose a novel approach to semi-supervised automatic speech recognition
(ASR). We first exploit a large amount of unlabeled audio data via
representation learning, where we reconstruct a temporal slice of filterbank
features from past and future context frames. The resulting deep contextualized
acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end
ASR system using a smaller amount of labeled audio data. In our experiments, we
show that systems trained on DeCoAR consistently outperform ones trained on
conventional filterbank features, giving 42% and 19% relative improvement over
the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our
approach can drastically reduce the amount of labeled data required;
unsupervised training on LibriSpeech then supervision with 100 hours of labeled
data achieves performance on par with training on all 960 hours directly.
Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral
On lexical level matching
In many natural language understanding applications, text processing requires comparing lexical units: words, phrases, name entities and sentences. A significant amount of research has taken place in studying evaluating similarity metrics between those units. In this thesis, we summarize some research work in computing lexical similarity. We describe a new approach to compute similarity between two spans of text, using multiple semantic-units level comparison measures to compute sentence-level similarity scores
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
Pretrained contextual word representations in NLP have greatly improved
performance on various downstream tasks. For speech, we propose contextual
frame representations that capture phonetic information at the acoustic frame
level and can be used for utterance-level language, speaker, and speech
recognition. These representations come from the frame-wise intermediate
representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken
utterances. We first train the model on the Fisher English corpus with
context-independent phoneme labels, then use its representations at inference
time as features for task-specific models on the NIST LRE07 closed-set language
recognition task and a Fisher speaker recognition task, giving significant
improvements over the state-of-the-art on both (e.g., language EER of 4.68% on
3sec utterances, 23% relative reduction in speaker EER). Results remain
competitive when using a novel dilated convolutional model for language
recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201
Hybrid Attention-based Encoder-decoder Model for Efficient Language Model Adaptation
Attention-based encoder-decoder (AED) speech recognition model has been
widely successful in recent years. However, the joint optimization of acoustic
model and language model in end-to-end manner has created challenges for text
adaptation. In particular, effectively, quickly and inexpensively adapting text
has become a primary concern for deploying AED systems in industry. To address
this issue, we propose a novel model, the hybrid attention-based
encoder-decoder (HAED) speech recognition model that preserves the modularity
of conventional hybrid automatic speech recognition systems. Our HAED model
separates the acoustic and language models, allowing for the use of
conventional text-based language model adaptation techniques. We demonstrate
that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in
relative when out-of-domain text data is used for language model adaptation,
and with only a minor degradation in WER on a general test set compared with
conventional AED model
Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition
Most end-to-end (E2E) speech recognition models are composed of encoder and
decoder blocks that perform acoustic and language modeling functions.
Pretrained large language models (LLMs) have the potential to improve the
performance of E2E ASR. However, integrating a pretrained language model into
an E2E speech recognition model has shown limited benefits due to the
mismatches between text-based LLMs and those used in E2E ASR. In this paper, we
explore an alternative approach by adapting a pretrained LLMs to speech. Our
experiments on fully-formatted E2E ASR transcription tasks across various
domains demonstrate that our approach can effectively leverage the strengths of
pretrained LLMs to produce more readable ASR transcriptions. Our model, which
is based on the pretrained large language models with either an encoder-decoder
or decoder-only structure, surpasses strong ASR models such as Whisper, in
terms of recognition error rate, considering formats like punctuation and
capitalization as well
Induction chemotherapy‐based organ‐preservation protocol improve the function preservation compared with immediate total laryngectomy for locally advanced hypopharyngeal cancer—Results of a matched‐pair analysis
Abstract Background We performed a paired analysis to compare the therapeutic effect between the induction chemotherapy‐based organ‐preservation approach and immediate total laryngectomy in hypopharyngeal squamous cell carcinoma patients requiring total laryngectomy. Methods 351 patients who were treated with organ‐preservation approach were compared with 110 patients who were treated with total laryngectomy. The main measures and outcomes were progression‐free survival (PFS), overall survival (OS), and larynx function preservation survival (LFPS). Results No statistical difference was observed for 3‐, 5‐, and 10‐year PFS and OS in two groups. In the organ‐preservation group, the 3‐, 5‐, and 10‐year LFPS was 30.7%, 23.3%, and 16.6%, respectively. The LFPS of Stage III > Stage IV, N0 > N1 > N2 > N3, T2 > T3 > T4, CR > PR > SD > PD patients (all p values <0.05). Conclusions Survival outcomes did not significantly differ between the two groups. The organ‐preservation approach allowed more than 70% of the survivors to retain their larynx function
LGR5 marks targetable tumor-initiating cells in mouse liver cancer
Cancer stem cells (CSCs) or tumor-initiating cells (TICs) are thought to be the main drivers for disease progression and treatment resistance across various cancer types. Identifying and targeting these rare cancer cells, however, remains challenging with respect to therapeutic benefit. Here, we report the enrichment of LGR5 expressing cells, a well-recognized stem cell marker, in mouse liver tumors, and the upregulation of LGR5 expression in human hepatocellular carcinoma. Isolated LGR5 expressing cells from mouse liver tumors are superior in initiating organoids and forming tumors upon engraftment, featuring candidate TICs. These cells are resistant to conventional treatment including sorafenib and 5-FU. Importantly, LGR5 lineage ablation significantly inhibits organoid initiation and tumor growth. The combination of LGR5 ablation with 5-FU, but not sorafenib, further augments the therapeutic efficacy in vivo. Thus, we have identified the LGR5+ compartment as an important TIC population, representing a viable therapeutic target for combating liver cancer